Cross-species nucleic acid probes

ABSTRACT

The present invention features a collection of at least four nucleic acid probes. The probes each comprises a segment, the entirety of which hybridizes under low stringency conditions to at least a first gene of a first species and a second gene of a second species, wherein the hybridizing probes correspond to different genes of the first or second species, and the first and the second genes are orthologous to each other. In some embodiments, the entirety of the segment is at least 60% (e.g., 65%, 70%, 80%, 90%, or 95%) identical to the first gene and the second gene.

RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application Ser.No. 60/357,541, filed on Feb. 15, 2002 and U.S. application Ser. No.10/366,823, filed on Feb. 14, 2003, the contents of which areincorporated herein by reference.

BACKGROUND

Nucleic acid arrays are ordered collections of nucleic acid probes.Specific hybridization of nucleic acids in a sample to the probes can beused to interrogate the composition and abundance of nucleic acids inthe sample. There are numerous applications of nucleic acid arraysincluding analysis of gene expression and of gene polymorphisms.

The nucleic acid sequences of a number of genomes are now available.These genomes include the genomes of humans, model eukaryotic organisms(e.g., Drosophila melanogaster, Caenorhabditis elegans, andSaccharomyces cerevisiae) and bacterial pathogens. A number of geneshave been identified that are conserved among metazoans: for example,genes encoding homeobox domains, kinases, and biosynthetic enzymes.Conserved genes and conserved proteins can be identified by aligningamino acid or nucleotide sequences, typically using a computer program.Such an analysis can be extended to the genomic scale. Some exemplarystudies are described in Peltonen and McKusick (2001) Science 291:1224-1229; O'Brien et al. (1999) Science 286: 458-481; and Rubin et al.(2000) Science 287: 2204-2215. Comparative genomics facilitates theelucidation of the evolutionary basis of cellular and developmentalprocesses in different phyla.

SUMMARY

This invention relates to nucleic acid probes that can be used toanalyze samples from a wide spectrum of species. The probes are designedfor cross-species hybridization, e.g., by identification of conservedsegments among orthologous genes (also termed “orthologs”). The probescan be attached to an array that can be used for comparative geneanalyses.

In one aspect, this invention features a collection of at least four(e.g., at least 5, 50, 100, 500, or 1000, or any number between 4 and10⁴, 10⁵, or 10⁶, or any number beyond 10⁶) nucleic acid probes. Theprobes each include a segment, the entirety of which hybridizes underlow stringency conditions to at least a first gene of a first speciesand a second gene of a second species. The hybridizing probes correspondto different genes of the first or second species. In each case, thefirst and the second genes are orthologous to each other. Sequencesother than the above-described probes, if any, are not considered asprobes, e.g., sequences necessary for experimental controls or for otherpurposes. In some embodiments, the entirety of the segment is at least60% (e.g., 65%, 70%, 80%, 90%, or 95%) identical to the first gene andthe second gene. In no case is the segment less than 20 nucleotides.

The probes (including DNA and RNA) are nucleic acids that are entirelysingle stranded or partially single stranded. They can be prepared by achemical synthesis method, or prepared by a polymerase chain reaction. Aprobe can be between 20 to 2000, 20 to 800, or, 20 to 500 nucleotides inlength. Each probe can be attached to a solid support, e.g., the samesolid support or different solid supports. Alternatively, each probe canbe free in solution. For example, each probe can be labeled or tagged.Optionally, each probe can be immobilized, e.g., after hybridization. Insome embodiments, the probes each include a consensus sequence, or oneor more degenerate positions. A “consensus” sequence is a sequencederived from a profile of two or more related sequences. For example, atpositions that differ, the most common nucleotide can be included in theconsensus. Alternatively, the position can be made degenerate. Withrespect to a population of probes, a “degenerate” position refers to aposition that includes either different nucleotides among the populationor an atypical nucleotide (i.e., not adenine, guanine, cytosine, uracil,or thymidine) such as inosine.

The term “segment,” as used herein, refers to a nucleic acid that isconserved among orthologs. The segment is at least 20 nucleotides inlength (e.g., 20 to 600, or 20 to 200 nucleotides). A “DNA” refers tothe polymeric form of deoxyribonucleotides (adenine, guanine, thymine,or cytosine) in its either single stranded form, or a double-strandedhelix. An “RNA” refers to the polymeric form of ribonucleotides(adenine, guanine, uridine, or cytosine) in its either single strandedfor, or a double-stranded helix.

In some embodiments, the probe includes a nucleic acid segment of ahuman disease gene. The gene can encode, for example, an enzyme, atranscription factor, a cell surface protein, or a functional domain.See, e.g., the “Probe Selection” section.

The term “species” refers to a naturally existing population of similarorganisms that are given a unique name to distinguish them from allother creatures. A species may include various “serotypes” and“strains,” i.e., the classification of the sub-species, or thedescendants of a common species. The first and the second species can beof different genera, different families, different orders, differentclasses, different phyla, or different kingdoms. The first and thesecond species can also be different species of the same genus, the samefamily, the same order, the same class, the same phylum, or the samekingdom. For example, the first and second species might be of differentphyla, but the same kingdom, and so forth. The first and the secondspecies can be, independently, a mammal (e.g., a human or a mouse), aninvertebrate (e.g., Drosophila or C. elegans), a fungus (e.g., S.cerevisiae), a bacterium (e.g., B. subtilis, E. coli, or P. aeruginosa),or virus (e.g., virus that infects mammalian cells).

In some embodiments, the first species can be a different class from thesecond species, or a different phylum from the second species. The firstand the second species can be, independently, a mammal, an invertebrate,a plant, a fungus, a bacterium, or virus. Clearly, some species aremicroorganisms, others are multicellular. In one exemplary combination,the first species is a mammal (e.g., human), and the second species isan invertebrate (e.g., Drosophila). Additional examples of speciesinclude: Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio,Dictyostelium discoideum, Escherichia coli, Takifugu rubripes, HepatitisC virus, Mus musculus, Mycoplasma, Oryza sativa, Zea mays, Plasmodiumfalciparum, Pneumocystis carinii, Rattus, Saccharomyces cerevisiae,Schizosaccharomyces pombe, and Xenopus laevis.

In another aspect, this invention features a method for providing acollection of probes. The method includes for each of a plurality ofgenes of a first species, identifying a segment that is conservedrelative to an orthologous gene from a second species; selecting asequence from the orthologous gene that is at least 60% (e.g., 65%, 70%,80%, 90%, or 95%) identical to the conserved segment, and preparing aprobe having the selected sequence or a reverse complement of theselected sequence, thereby producing a collection of probes. In someembodiments, the selected sequence is identical to the conservedsegment, or identical to a segment of the orthologous gene. In someother embodiments, the selected sequence is a consensus sequence.

This invention also features a collection of probes provided by thejust-described method.

In still another aspect, this invention features a method for evaluatinga sample. The method includes contacting a sample that comprises nucleicacids from a third species to a collection of nucleic acid probes,wherein (i) each probe comprises a segment, the entirety of whichcross-hybridizes under low stringency conditions to at least a firstgene of a first species and a second gene of a second species, whereinthe collection includes probes corresponding to different genes of thefirst or second species, and the first and the second genes areorthologous to each other, and (ii) the third species differs from boththe first and the second species; evaluating binding of the sample toeach of the probes; and for each probe that is bound, inferring thepresence, in the sample, of a nucleic acid for a third gene of the thirdspecies, the third gene being an ortholog of the first and second gene.In some embodiments, the third species is a different order from eitherthe first or the second species, or a different order from both thefirst and the second species. The nucleic acids from the third speciescan be genomic DNA, mRNA, or reversed transcribed mRNA.

In further another aspect, this invention features a substrate thatincludes a plurality of addresses (e.g., at least 50 addresses, but lessthan 10 000, at least 300 addresses, but less than 5 000, or at least500 addresses, but less than 2 000), each address comprising a nucleicacid probe, each probe comprising a segment, the entirety of whichcross-hybridizes under low stringency conditions to at least a firstgene of a first species and a second gene of a second species, whereinthe collection includes probes corresponding to different genes of thefirst or second species, and the first and the second genes areorthologous to each other. The nucleic acid segment can be between 15and 600 nucleotides in length, and the first species can be a differentorder from the second species.

The probes can be covalently, or non-covalently attached to thesubstrate. The substrate can be a bead, and each expressed sequence canbe attached to a different bead. The substrate can be an array, e.g., aglass (e.g., a surface-modified glass), a membrane, or a polymer, andeach expressed sequence can be attached to a different address of thesame array. The number of the different addresses can be at least 50,but less than 10 000, at least 300, but less than 5 000, or at least500, but less than 2 000.

Also within the scope of this invention is a collection of nucleic acidprobes. Each probe set includes a first and second probe. The firstprobe includes a first nucleic acid sequence identical to a segment of afirst gene of a first species. The second probe includes a secondnucleic acid sequence identical to a segment of a second gene of asecond species and non-identical to the segment of the first gene.Moreover, the first and second genes are orthologs. Each probe set canfurther include at least a third probe that has a third nucleic acidsequence identical to a segment of a third gene from a third species.The collection can include a plurality of at least 20, 50, 100, 500 or1000 probe sets. In one embodiment, the collection includes no more than20 000 or 5 000 probe sets.

In another aspect, this invention features a nucleic acid arraysubstrate that includes a plurality of address sets. Each address setincludes at least a first and second addresses. The first address has afirst nucleic acid probe and the second address has a second nucleicacid probe. The first probe includes a first nucleic acid sequenceidentical to a segment of a first gene of a first species. The secondprobe includes a second nucleic acid sequence identical to a segment ofa second gene of a second species and non-identical to the segment ofthe first gene. The first and second genes are orthologs.

The term “address set” merely means a group of addresses (e.g., a groupof at least two addresses) that are related by the identity of the probeat each address. The term does not imply a spatial or other physicalrelationship. In one embodiment, each address of an address set isadjacent to at least another address of the set. Yet, in anotherembodiment, each address of the set can be located in a different regionof the substrate. For example, the substrate can include at least afirst and second non-overlapping region. Each first address of a set islocated within the first region and each second address of the set islocated within the second region.

In further another aspect, the present invention features a method thatincludes for each of a plurality of expressed sequences of a firstspecies, identifying an orthologous expressed sequence in a secondspecies; selecting a nucleic acid segment of the orthologous expressedsequence; preparing the nucleic acid segment; and attaching the nucleicacid segment to a substrate. The first species can be a different orderfrom the second species. The nucleic acid segment is at least 60%identical between the first and the second species, and is between 15and 600 nucleotides in length.

In still another aspect, this invention features a method for evaluatinga sample. The method includes providing an array, contacting a samplethat comprises nucleic acids from a third species to the array, in whichthe third species differs from both the first and the second species;and evaluating binding of the sample to the array. The array includes asubstrate having a first plurality of addresses, each address includinga first unique probe that comprises a first nucleic acid segment of afirst species. The first nucleic acid segment is at least 60% identicalto a gene of the first species and to its ortholog in a second species.The substrate also has a second plurality of addresses, each of whichcorresponds to each of the first plurality of addresses, and includes asecond unique probe that comprises a second nucleic acid segment of thesecond species. Each of the first and the second nucleic acid segmentsis between 15 and 600 nucleotides in length, and the first species is adifferent order from the second species.

The “percent identity” of two amino acid sequences or of two nucleicacids is determined using the algorithm of Karlin and Altschul (1990)Proc. Natl. Acad. Sci. USA 87: 2264-68, modified as in Karlin andAltschul (1993) Proc. Natl. Acad. Sci. USA 90: 5873-77. Such analgorithm is incorporated into the NBLAST and XBLAST programs (version2.0) of Altschul et al. (1990) J. Mol. Biol. 215: 403-10. BLASTnucleotide searches can be performed with the NBLAST program, score=100,wordlength-12 to obtain nucleotide sequences homologous to the nucleicacid molecules of the invention. BLAST protein searches can be performedwith the XBLAST program, score=50, wordlength=3 to obtain amino acidsequences homologous to the protein molecules of the invention. Wheregaps exist between two sequences, Gapped BLAST can be utilized asdescribed in Altschul et al. (1997) Nucleic Acids Res. 25(17):3389-3402. When utilizing BLAST and Gapped BLAST programs, the defaultparameters of the respective programs (e.g., XBLAST and NBLAST) can beused. See the online site provided by National Center for BiotechnologyInformation (NCBI) at the National Institute of Health (NIH), Bethesda,Md.

As used herein, the term “hybridizes under low stringency conditions,”describes conditions for hybridization and washing. Guidance forperforming hybridization reactions can be found in Current Protocols inMolecular Biology, John Wiley & Sons, N.Y. (1989), 6.3.1-6.3.6, which isincorporated by reference. Aqueous and nonaqueous methods are describedin that reference and either can be used. Low stringency hybridizationconditions referred to herein are as follows: in 6× sodiumchloride/sodium citrate (SSC) at about 45° C., followed by two washes in0.2×SSC, 0.1% SDS at least at 50° C. (the temperature of the washes canbe increased to 55° C. for increased stringency conditions). Under thelow stringency hybridization conditions, the probes including aconsensus sequence, or one or more degenerate positions can hybridize toa first gene of a first species and a second gene of a second species,as described above.

Other exemplary hybridization conditions include: (i) medium stringencyhybridization conditions in 6×SSC at about 45° C., followed by one ormore washes in 0.2×SSC, 0.1% SDS at 60° C.; (ii) high stringencyhybridization conditions in 6×SSC at about 45° C., followed by one ormore washes in 0.2×SSC, 0.1% SDS at 65° C.; and (iii) very highstringency hybridization conditions are 0.5M sodium phosphate, 7% SDS at65° C., followed by one or more washes at 0.2×SSC, 1% SDS at 65° C. Ofcourse, consensus probes and degenerate probes may also hybridize underthese, more stringent conditions.

Further, the stringency conditions for hybridization and washing canalso be altered by the following factors: For the hybridizationconditions: (I) in the presence of formamide in a hybridization buffer,(i) the higher of formamide (ranging from 25 to 50%), the lower of thestringency, (ii) the higher of the hybridization temperature (37° C. to45° C.), the higher of the stringency, and (iii) the higher of the SSC(3× to 6×), the lower of the stringency; and (II) In the absence offormamide in a hybridization buffer, the higher of the hybridizationtemperature (ranging from 50° C. to 65° C.), the higher of thestringency. For the washing conditions: (i) the higher of the washingtemperature (25° C. to 65° C.) the higher of the stringency, and (ii)the higher of the SSC (0.1× to 2×), the lower of the stringency.

The arrays can be used for a wide variety of applications. Oneapplication is a cross species study, especially for those organismsthat do not have a comparable amount of sequence information toconstruct a species-specific microarray. A cross-species array has thefurther advantage of economy as a cross-species array can be used formore than one species. For example, a cross-species array designed fromDrosophila and human sequences can be used to analyze nucleic acid fromrodents, fish, snakes, and crustaceans (such as lobster). This approachcan also be applied to design cross-species arrays appropriate forplants or microorganisms, e.g., fungi or bacteria. In oneimplementation, the cross-species array is designed from sequences fromextremely divergent species (e.g., species of different kingdoms). Thedegree of divergence can be tailored according to the desiredapplication.

Some other applications include the analysis of samples to diagnoseinfectious diseases caused by related pathogens (e.g., the array is across-species array designed from microbial sequences), and the analysisof samples to implement plant and animal quarantines. Other of exemplarypractical applications of cross-species analyses include: (i) supplyinganimal models for human genetic diseases; (ii) identifying candidategenes that participate in conserved disease processes; (iii) assessingmulti-factorial genetic traits; (iv) identifying adaptations innon-human mammal species that ameliorate maladies homologous to humanhereditary and infectious diseases; and (v) developing treatments forveterinary pathologies based on human trials.

In accordance with the present invention there may be employedconventional molecular biology, microbiology, and recombinant DNAtechniques within the skill of the art. Such techniques, as well astechnique terms, are explained fully in the literature. See, e.g.,Ausubel, R. M., ed. (1994) “Current Protocols in Molecular Biology”Volumes I-III; Celis, J. E. ed. (1994) “Cell Biology: A LaboratoryHandbook” Volumes I-III; Gait, M. J. ed. (1984) “OligonucleotideSynthesis”; and Hames, B. D. & Higgins, S. J. eds. (1985) “Nucleic AcidHybridization.”

This invention also features a collection that includes at least fournucleic acid probes (e.g., at least 5, 50, 100, 500, or 1000, or anynumber between 4 and 10⁴, 10⁵, or 10⁶, or any number beyond 10⁶), theprobes each including a segment, the entirety of which hybridizes underlow stringency conditions to at least a first gene of a firstsub-species and a second gene of a second sub-species, wherein thehybridizing probes correspond to different genes of the first or secondsub-species, and the first and the second genes are orthologous to eachother. The first and second sub-species can be two strains or twoserotypes. In some embodiments, at least one of the probes comprises atleast one degenerate position.

In a further aspect, this invention features a method for evaluating asample. The method includes contacting a sample that comprises nucleicacids from a third sub-species to a collection of nucleic acid probes,wherein (i) each probe comprises a segment, the entirety of whichhybridizes under low stringency conditions to at least a first gene of afirst sub-species and a second gene of a sub-second species, wherein thehybridizing probes correspond to different genes of the first or secondsub-species, and the first and the second genes are orthologous to eachother, and (ii) the third species is the same or different from thefirst or the second species; evaluating binding of the sample to each ofthe probes; and for each probe that is bound, inferring the presence, inthe sample, of a nucleic acid for a third gene of the third sub-species,the third gene being an ortholog of the first and second gene.

Also within the scope of this invention is a packaged product. Thepackaged product includes a container, one of the aforementionedcollections in the container, and a legend (e.g., a label or an insert)associated with the container and indicating use of the collection foridentifying orthologous genes among different species or sub-species.

Other features, objects, and advantages of the invention will beapparent from the description and from the claims.

DETAILED DESCRIPTION

The present invention relates to a collection of probes that aredesigned to specifically recognize related nucleic acids from aplurality of species. In one typical implementation the probes areattached to a planar array. The probe collections are designed by firstidentifying sequences from a first species, identifying sequenceorthologs in at least a second species, and then constructing probes.

Probe Selection

To design a collection of probes, sequences of interest from a firstspecies are identified. These sequences can be chosen with respect tothe application of interest. For example, to identify pathogens,sequences correlated or associated with pathogenesis can be selected.

In one example, the comparative analysis focuses on human disease genesand their orthologs in other species. A “human disease gene,” herein,refers to a human gene which is naturally polymorphic, and for which onepolymorphic allele is correlated with a diagnosable disorder orphenotype. This example of comparative analysis can provide into themechanism and function of genes associated with diseases. Thepolymorphic allele can include a mutation, insertion (e.g.,trinucleotide repeat expansion), deletion, loss of heterozygosity, oramplification relative to a normal allele, e.g., an allele notcorrelated or anti-correlated with the diagnosable disorder orphenotype.

Exemplary human diseases for which human disease genes have beenidentified include cancer, neurological disorders, and endocrinediseases. Human disease genes that are associated with cancer includemenin (MEN; multiple endocrine neoplasia type 1), Peutz-Jeghers disease(STK11), ataxia telangiectasia (ATM), multiple exostosis type 2 (EXT2),a second bCL2 family member, a second retinoblastoma family member, andp53-like protein encoded genes. Human disease genes that are associatedwith neurological disorders include tau (frontotemporal dementia withParkinsonism), the Best macular dystrophy gene, neuroserpin (familialencephalopathy), genes for limb girdle muscular dystrophy types 2A and2B, the Friedreich ataxia gene, the gene for Miller-Diekerlissencephaly, parkin (juvenile Parkinson's disease), and the Tay-Sachsand Stargardt's disease genes. Orthologs of many of these genes arepresent in Drosophila (see, e.g., Rubin et al. (2000) Science 287:2204-2215).

A human disease gene can encode a polypeptide that includes an enzyme, atranscription factor, a cell surface protein, or a functional domain. A“functional domain” includes a polypeptide fragment that canindependently participate in an interaction, e.g., an intramolecular oran intermolecular interaction. An intermolecular interaction can be aspecific binding interaction or an enzymatic interaction (e.g., theinteraction can be transient and a covalent bond is formed or broken).

An analysis can also performed based on genes of different species, andprovide insights into understandings of the evolutionary basis ofcellular and developmental processes. An probe collections of thisinvention can include a set of probes relating to genes associated withprocesses including cell division, cell shape, signaling pathways,cell-cell and cell-substrate adhesion, and apoptosis—determining thedevelopmental outcomes of different embryos. The processes can alsoinclude cell-cell interactions, cell polarity and cellmovement—determining embryonic gradients, as well as the processes ofneuronal signaling and innate immunity. Examples of cell cycle relatedgenes include cyclin A (CycA), CycB, CycB3, CycE, and CycD. Otherconserved cyclins that are associated with transcription include CycC,CycH, CycK, and CycT. Examples of orthologs related to cytoskeletoninclude the tubulin superfamily, such as α-, β-, γ-, δ-, and ε-tubulin,which have been identified in both human and Drosophila. See Rubin etal. (2000) Science 287: 2204-2215.

In another example, the collection of probes is formulated from genesassociated with bacterial pathogenesis. The first and second species canbe gram-negative and/or gram-positive bacteria for which genesassociated with pathogenesis have been identified. Probes to these genescan be used to analyze the pathogenicity of a sample that potentiallyincludes a pathogenesis-associated gene of a third species, differentfrom the first and second species.

Orthologs

Once a group of genes or proteins of interest has been identified from afirst or “source” species, an ortholog is identified for each gene orprotein from at least a second species. Orthologs are biopolymericsequences (e.g., nucleic acid or polypeptide sequences) that are foundin different species, yet have sequence similarity, and are predicted bya skilled artisan to perform similar functions. An “ortholog” isdistinguished from a “homologue” as follows. An ortholog of a firstsequence of a first species is the sequence of a second species with themost homology to the first sequence relative to other availablesequences of the second species. For example, some sequences are membersof large protein families. Even with the same species, a particularsequence may have multiple homologs. However, with respect to a secondspecies, the particular sequence has a distinct ortholog which is themost homologous sequence in the second species. Orthologs can beassigned by comparing numerous sequences to identify the best match-up.See, e.g., Science Oct. 24, 2007; 278(5338): 631-7 and Nucleic Acids ResJan 1, 2001; 29(1): 22-28 for some exemplary methods and resource forassigning orthologs based on complete genome coverage.

The second species from which the orthologs are identified can bejudiciously chosen in accordance with the desired application. Thesecond species can be as divergent or related to the first species asdesired. For example, the second species can be from the same kingdom,phyla, or class. The second species can also be from a different order,class, phyla, or kingdom. The National Center for BiotechnologyInformation (NCBI; Bethesda Md.) also provides a taxonomy on-lineresource that can be used to determine the phylogenetic relationshipbetween two species (Wheeler et al. (2000) Nucleic Acids Res. 28:10-4).The site can be used to identify the full taxonomic classification of aspecies.

For both nucleic acid and polypeptide sequences, orthologs can beidentified by a sequence comparison search. To identify an ortholog of aparticular sequence of a first sequence, the first sequence isiteratively compared with each available sequence of the second species.It is, of course, useful, but not necessary to have sequence informationfor the entire genome of the second species. It can be complementary tothe coding strand or to the non-coding strand of a gene.

Information for nucleic acid and protein sequences of a species can beretrieved from publicly available databases. Such databases include, butare not limited to, Online Mendelian Inheritance in Man (OMIM), theCancer Genome Anatomy Project (CGAP), GenBank, EMBL, PIR, SWISS-PROT,and the like. These databases can be accessed from their on-linefacilities using uniform resource locators that are well known to thoseskilled in the art. Some of these databases contain complete or partialnucleotide sequences for a particular species. In addition, for somespecies, a large fraction, i.e., close to the entire fraction of thegenome is available.

The comparisons can be computed using a computer program such as BLAST.After candidate orthologs are identified, further refinements can beused to identify the ortholog. For example, the sequence match searchprogram, e.g., from the EMBOSS suite of programs (available from UKMedical Reseach Council HGMP Resource Centre, Hinxton, Cambridge, CB101SB, United Kingdom), can be used to plot a dot matrix figure of theparticular sequence and its ortholog. Based on the match density at agiven location, there may be no dots, isolated dots, or a set of dots soclose together that they appear as a line. The presence of linesindicates the sequence homology.

Potential orthologs for polypeptides and nucleic acids can be furtheranalyzed to select representative sequences that fit criteria for beingconserved sequences. The criteria can be based on cut-off values,referred to as E-values. The lower the E-value represents the better thematch. The E-value can be defined depending upon the stringency, ordegree of homology desired, as described above. Orthologs can beidentified on the basis of nucleic acid or amino acid sequence homology.Typically, nucleic acid sequence homology is used as this homologyreflects the utility of nucleic acid probes in hybridization.

Once an ortholog is identified for a particular sequence, both sequencesare scanned to identify a conserved segment of about 15 to 300nucleotides. The length and sequence composition of the segment can beselected such that the sequence has a desired degree of sequenceconservation (e.g., E-value or % identity), Tm, specificity,composition, and length. For a particular collection of probes, theseparameters can be defined with specific boundaries in order to insurehomogeneity in the probe behavior within the collection.

Phylogenetic programs, such as PHYLIP, ClustalW, and Pfam, can be usedto compare a family of related sequences and to thereby derive aconsensus sequence. The consensus sequence can be used as a probe. Insome implementations, at ambiguous positions, a degenerate nucleotide isincluded.

Probe Construction

Once a conserved segment is identified, a probe corresponding to thatsegment is constructed. Any of a variety of methods can be used tosynthesize the probes. Such methods include chemical synthesis,photolithography, recombinant DNA techniques, and nucleic acidamplification.

In one embodiment, PCR is used to construct the probes. PCR primers aredesigned that hybridize to the sense and the anti-sense strands ofnucleic acids from the first and/or second species. The primers are usedto amplify the segment, e.g., by the polymerase chain reaction (PCR).The primers can include a primer pair to amplify a probe from the firstspecies and another primer pair to amplify a probe from the secondspecies. This approach can be extended to obtain orthologous probes fromat least a third species. In some cases, the probes sequence aresufficiently identical that a single primer pair suffices. In othercases, probes are only obtained from one of the two species.

In one implementation, asymmetric PCR is used to generate largelysingle-stranded nucleic acid probes. In another implementation, one ofthe primers is tagged, e.g., with biotin). After amplification, theextended-tagged primer is isolated from complementary nucleic acidstrands to obtain a single-stranded nucleic acid probe.

The amplified nucleic acids are used as probes or formatted for suchuse. For example, the probes can be immobilized onto a planar substrateto produce a nucleic acid array. The probes can also be attached to aparticle (such as a bead). Each probe of the collection can be attachedto a different bead. In still another example, the probes are labeledfor hybridization experiments. Further, the probes can be packaged incontainers (collectively or individually) as a kit.

Arrays

An array of this invention can have many addresses on a substrate. Thefeatured array can be configured in a variety of formats, non-limitingexamples of which are described below.

A substrate can be opaque, translucent, or transparent. The addressescan be distributed, on the substrate in one dimension, e.g., a lineararray; in two dimensions, e.g., a planar array; or in three dimensions,e.g., a three dimensional array. The solid substrate may be of anyconvenient shape or form, e.g., square, rectangular, ovoid, or circular.Non-limiting examples of two-dimensional array substrates include glassslides, quartz (e.g., UV-transparent quartz glass), single crystalsilicon, wafers (e.g., silica or plastic), mass spectroscopy plates,metal-coated substrates (e.g., gold), membranes (e.g., nylon andnitrocellulose), plastics and polymers (e.g., polystyrene,polypropylene, polyvinylidene difluoride, poly-tetrafluoroethylene,polycarbonate, nylon, acrylic, and the like). Three-dimensional arraysubstrates include porous matrices, e.g., gels or matrices. Potentiallyuseful porous substrates include: agarose gels, acrylamide gels,sintered glass, dextran, meshed polymers (e.g., macroporous crosslinkeddextran, SEPHACRYL™, and SEPHAROSE™), and so forth. Still othersubstrates include surfaces of microfluidic channels and devices, suchas “Lab-On-A-Chip™” (Caliper Technologies Corp.).

An array can have a density of at least than 10, 50, 100, 200, 500, 1000, 2 000, 10⁴, 10⁵, 10⁶, 10⁷, 10⁸, or 10⁹ or more addresses per cm²and ranges between. In some embodiments, the plurality of addressesincludes at least 100, 500, 1 000, or 5 000 addresses. In some otherembodiments, the plurality of addresses includes less than 99, 499, 999,or 9 999, addresses. Addresses in addition to the address of theplurality can be disposed on the array. The center to center distancecan be 5 mm, 1 mm, 100 μm, 10 μm, 1 μm or less. The longest diameter ofeach address can be 5 mm, 1 mm, 100 μm, 10 μm, 1 μm or less. Eachaddresses can contain 1.0 μg, 100 ng, 10 ng, 1 ng, 100 pg, 10 pg, 1 pg,0.1 pg or less of a capture agent, i.e. the capture probe. For example,each address can contain 100, 10³, 10⁴, 10⁵, 10⁶, 10⁷, 10⁸, or 10⁹ ormore molecules of the nucleic acid.

A nucleic array can be fabricated by a variety of methods, e.g.,photolithographic methods (see, e.g., U.S. Pat. Nos. 5,143,854;5,510,270; and. 5,527,681), mechanical methods (e.g., directed-flowmethods as described in U.S. Pat. No. 5,384,261), pin based methods(e.g., as described in U.S. Pat. No. 5,288,514), and bead basedtechniques (e.g., as described in PCT US/93/04145). A capture probe canbe a single-stranded nucleic acid, a double-stranded nucleic acid (e.g.,which is denatured prior to or during hybridization), or a nucleic acidhaving a single-stranded region and a double-stranded region. Thecapture probe can be selected by a variety of criteria, and can bedesigned by a computer program with optimization parameters. The captureprobe can be selected to hybridize to a sequence rich (e.g.,non-homopolymeric) region of a nucleic acid. The T_(m) of the captureprobe can be optimized by prudent selection of the complementarityregion and length. Ideally, the T_(m) of all capture probes on the arrayis similar, e.g., within 20, 10, 5, 3, or 2° C. of one another. Adatabase scan of available sequence information for a species can beused to determine potential cross-hybridization and specificityproblems.

Evaluating a Sample

The collections of probes described herein can be used to evaluate asample, particularly a sample that includes a nucleic acid of a thirdspecies that differs from the species used to construct the probes. Theevaluation can generate information about the abundance of differentnucleic acids in the sample.

For example, if the sample is mRNA or cDNA from a cell or tissue, theinformation can indicate the level of expression of different genes inthe cell or the cells of the tissue.

First, RNA is prepared from the sample, e.g., using routine methods. RNAisolation can 20 include DNase treatment to remove genomic DNA andhybridization to an oligo-dT coupled a solid substrate (e.g., asdescribed in Current Protocols in Molecular Biology, John Wiley & Sons,N.Y). The oligo-dT solid substrate is washed and the RNA is eluted. TheRNA is then reversed transcribed and, optionally, amplified.

Typically the sample is directly or indirectly labeled. The amplifiedand/or reverse-transcribed nucleic acid can be labeled, e.g., by theincorporation of a labeled nucleotide. Examples of labels includefluorescent labels, e.g., red-fluorescent dye Cy5 (Amersham) orgreen-fluorescent dye Cy3 (Amersham), chemiluminescent labels, e.g., asdescribed in U.S. Pat. No. 4,277,437, and colorimetric detection.Alternatively, the amplified nucleic acid can be labeled with biotin anddetected after hybridization with labeled streptavidin, e.g.,streptavidin-phycoerythrin (Molecular Probes).

The labeled nucleic acid is then hybridized to the cross-species array.In addition, a control nucleic acid or a reference nucleic acid can becontacted to the same array. The control nucleic acid or referencenucleic acid can be labeled with a label other than the sample nucleicacid, e.g., one with a different emission maximum.

Labeled nucleic acids are contacted to an array under judiciously chosenhybridization conditions. Some exemplary specific hybridizationconditions include: (i) low stringency hybridization conditions in 6×sodium chloride/sodium citrate (SSC) at about 45° C., followed by twowashes in 0.2×SSC, 0.1% SDS at least at 50° C. (the temperature of thewashes can be increased to 55° C. for low stringency conditions); (ii)medium stringency hybridization conditions in 6×SSC at about 45° C.,followed by one or more washes in 0.2×SSC, 0.1% SDS at 60° C.; (iii)high stringency hybridization conditions in 6×SSC at about 45° C.,followed by one or more washes in 0.2×SSC, 0.1% SDS at 65° C.; and (iv)very high stringency hybridization conditions are 0.5M sodium phosphate,7% SDS at 65° C., followed by one or more washes at 0.2×SSC, 1% SDS at65° C. Additional guidance for performing hybridization reactions can befound in Current Protocols in Molecular Biology, John Wiley & Sons, N.Y.(1989), 6.3.1-6.3.6. Aqueous and nonaqueous hybridizations methods aredescribed in that reference and either can be used.

After washing, the array is detected to determine the amount of label ateach address. Detection can be by image acquisition or other methods.For example, unlabelled hybridized strands can be directly detectedusing surface plasmon resonance or a change in electrical conductance.

In one implementation, low stringency washing conditions are usedinitially. A first set of data is acquired to determine the amount ofprobe bound after the low stringency wash. Then the array is washedusing higher stringency conditions, e.g., medium stringency. A secondset of data is acquired to determine the amount of probe remaining. Thisprocess can be repeated to accrue a series of data sets, each indicatingthe amount of probe remaining at each address of the array. In cases,where a complete series of data is acquired, the series can be analyzed,e.g., using computer software, to estimate homology between the nucleicacid in a sample and the probe.

A hybridization profile of a sample can be determined based on theextent of hybridization at different addresses of the array. A“hybridization profile” includes a plurality of values, wherein eachvalue corresponds to the level of hybridization of a sample or anamplified sequence to a probe. The value can be a qualitative orquantitative assessment of the level of hybridization.

The profile can be used to characterize the sample. For example, theprofile can be compared to a standard profile, e.g., the hybridizationprofile of a well-studied cell population from a particular species.

In one embodiment, the extent of hybridization at an address isrepresented by a numerical value and stored, e.g., in a vector, aone-dimensional matrix, or one-dimensional array. The vector x {x_(a),x_(b) . . . } has a value for each address of the array. For example, anumerical value for the extent of hybridization at a first address isstored in the variable x_(a). The numerical value can be adjusted, e.g.,for local background levels, sample amount, and other variations.Nucleic acid is also prepared from a reference sample and hybridized toan array (e.g., the same or a different array), e.g., with multipleaddresses. The vector y is construct identically to vector x. The samplehybridization profile and the reference profile can be compared, e.g.,using a mathematical equation that is a function of the two vectors. Thecomparison can be evaluated as a scalar value, e.g., a scorerepresenting similarity of the two profiles. Either or both vectors canbe transformed by a matrix in order to add weighting values to differentnucleic acids detected by the array.

In one particular embodiment, the profiles are processed to account fordifferential hybridization to two probes for orthologous genes fromdifferent species. In cases where the sample nucleic acid includesnucleic acid from a third species that differs from the species used todesign the probes, an algorithm can be used to determine if nucleic acidin the sample is hybridizing to similar extents to both related probes.A correlation algorithm can be used to quantitatively determine if thetwo (or more) related probes are detecting a similar signal. Thealgorithm can also account for phylogenetic proximity of the thirdspecies to one of the two species used to design the probes relative tothe other. Once a favorable comparison is made, the profile can bereconfigured to account for the inferred abundance of the nucleic acidof the third species based on hybridization to the two orthologousprobes. Thus, the use of two probes, one from each ortholog, can be usedto improve the quality and reliability of hybridization profilesrelative to a single probe.

Profile data can be stored in a database, e.g., a relational databasesuch as a SQL database (e.g., Oracle or Sybase database environments).The database can have multiple tables. For example, raw hybridizationdata can be stored in one table, wherein each column corresponds to anucleic acid being assayed, e.g., an address or an array, and each rowcorresponds to a sample. A separate table can store identifiers andsample information, e.g., the batch number of the array used, date, andother quality control information.

Nucleic acids that are present at similar levels can be identified byclustering data. Nucleic acids can be clustered using hierarchicalclustering (see, e.g., Sokal and Michener (1958) Univ. Kans. Sci. Bull.38: 1409), Bayesian clustering, k-means clustering, and self-organizingmaps (see, Tamayo et al. (1999) Proc. Natl. Acad. Sci. USA 96: 2907).

In one particular embodiment, the hybridization profiles represent theexpression of genes in a cell. The profiles from such a nucleic acidexpression analysis are used to compare samples and/or cells in avariety of states, e.g., as described in Golub et al. ((1999) Science286: 531). For example, multiple expression profiles from differentconditions and including replicates or like samples from similarconditions are compared to identify nucleic acids whose expression levelis predictive of the sample and/or condition. Each candidate nucleicacid can be given a weighted “voting” factor dependent on the degree ofcorrelation of the nucleic acid's expression and the sample identity. Acorrelation can be measured using a Euclidean distance or the Pearsoncorrelation coefficient.

The similarity of a sample expression profile to a predictor expressionprofile (e.g., a reference expression profile that has associatedweighting factors for each nucleic acid) can then be determined, e.g.,by comparing the log of the expression level of the sample to the log ofthe predictor or reference expression value and adjusting the comparisonby the weighting factor for all nucleic acids of predictive value in theprofile.

An array of human diseases related genes can be used for analysis ofdifferential gene expressions. The human diseases related genes can beidentified in both human and Drosophila, as described above. Monitoringthe differential gene expression of a nucleic acid in response to thecancer related proteins in a developing organism, e.g., developingDrosophila, can provide insight into understandings of the role of thecancer related proteins in mammals.

The specific example below is to be construed as merely illustrative,and not limitative of the remainder of the disclosure in any waywhatsoever. Without further elaboration, it is believed that one skilledin the art can, based on the description herein, utilize the presentinvention to its fullest extent. All publications recited herein arehereby incorporated by reference in their entirety.

EXAMPLE

Identification of Orthologs

A nucleic acid array that includes probes to orthologs in both human andDrosophila was constructed. First, a computer program was used toidentify nucleic acid probes specific for orthologs that were conserveddespite the evolutionary distance between these two species. The BLAST(Basic Local Alignment Sequence Tools, optimized for the x86 LINUX SMParchitecture) algorithm was iteratively used to query a non-redundantdatabase (“nr”, which is a database that combines SWISSPROT, TREMBL, andPIR 2001.3) with each predicted translated sequence from GenomeAnnotation Database of Drosophila (version 2, “gadfly2”). The programwas tailored to output sequences of about 150 amino acids in length thataligned to the query sequence with an E-value <e⁻²⁰. The results of thisquery process indicate that 51% of predicted gene products in Drosophilahave at least 30% homology with human proteins. The summary of theresults is listed in Table 1 and Table 2. TABLE 1 Summary of humanorthologs for Drosophila genes based on E-value. Number of HumanOrthologs E value (Unique human genes) Total = 14333 (%)  1.00E−180 648(529) 4.52  1.00E−150 982 (836) 6.85  1.00E−120 1459 (1295) 10.18 1.00E−100 1902 (1719) 13.27 1.00E−80 2545 (2042) 17.75 1.00E−60 3448(2704) 24.05 1.00E−40 4679 (3547) 32.64 1.00E−20 6555 (4687) 45.721.00E−10 7510 (5258) 52.38

TABLE 2 Summary of human orthologs for Drosophila genes based on percentidentity. Protein Number of Human Identity (%) Orthologs % 90% 69 0.4880% 224 1.56 70% 597 4.16 60% 1236 8.62 50% 2372 16.54 40% 4308 30.0530% 7428 51.81Constructions of a Human-Fly Evolutionarily Conserved Gene Microarray

For each conserved pair of sequences, one Drosophila and one human, twosets of primers (20 to 25-mers) were designed. One set was used toamplify the Drosophila sequence of the conserved pair. The other set wasused to amplify the human sequence of the conserved pair. Each primerset was used in a PCR amplification reaction with appropriate templatecDNA or genomic DNA to amplify the Drosophila or human sequence. Theamplified segment was generally selected to include an approximately 150basepair region that was most conserved. After amplification, eachamplified segment was spotted onto a coated glass plate to form anarray. Each amplified segment from Drosophila was spotted adjacent tothe corresponding amplified segment from its human ortholog. A series ofdifferent concentrations of positive and negative control probes arealso spotted onto the array for normalization purposes. The summary ofthe results is listed in Table 3. TABLE 3 The accumulated percentage ofconserved coding sequences in highly identical orthologs* and gadfly2.Percentage in Percentage Identity of highly identical in gadfly2Conserved Region Ortholog number orthologs (%) (%) 90% 10 0.18 0.07 80%333 5.96 2.32 70% 2029 36.31 14.11 60% 5588 100 38.87*The highly identical orthologs were based on length of orthologsgreater than 150 bp.Probes Containing Degenerate Positions

Oligonucleotide primer sets (see Table 4 below) were used to amplify a5′UTR region containing 144 basepair in various serotypes ofPicornaviridae and a VP1 region containing 478 basepair in Enterovirus71. The regions were labeled with Cy5-dUTP after PCR amplification. Twodegenerated probes (i.e., 5′UTR degenerated probe and VP1 degeneratedprobe) were prepared, and their sequences were listed in Table 5. Thedegenerated probes were hybridized to amplified samples from fourdifferent serotypes of viruses, i.e., Enterovirus 71, CoxsackievirusA16, Echovirus 30, and Influenza A virus. The results show that the5′UTR degenerated probe, which was designed to recognize allEnteroviruses, specifically labeled after hybridization with amplifiedsamples from Enterovirus 71, Coxsackievirus A16, Echovirus 30, but notthe sample from Influenza A virus. In addition, the VP1 degeneratedprobe, to differentiate Enterovirus 71 from other Enterovirus serotypes,was tested. As expected, the results showed that the VP 1 degeneratedprobe specifically labeled after hybridization with the amplified samplefrom Enterovirus 71, but not from Coxsackievirus A16. Thus, a degenerateprobe can be designed to specifically detect different or the sameserotype of Enterovirus in a microarray analysis.

Enterovirus 71 is the major epidemic pathogen of hand, foot and mouthdisease in pan-pacific countries. Due to the multiple serotypes andgenogroups of the enterovirus, the sequences between different serotypesor strains of the same species are quite varied. Thus, the design of theoligo probes specific for diagnostic of Enterovirus 71 becomecomplicated. With the concept of orthologue probe, multiple species'sequences of the Picornavirus were aligned to design pan-Picornavirusprobes. In addition, probes specific to the Enterovirus 71 thatcorresponding to the aligned multiple strains' sequences of Enterovirus71 were designed. As described above, the probes carried multipledegenerated sites in their sequences to represent the collection of thetarget viruses. The binding efficacy and specificity of theseorthologous probes to the Enterovirus 71 targets were also shown above.TABLE 4 Specific oligonucleotide primer sets for PCR amplification ofPicornaviridae. Specific Length Amplicon Region¹ Serotypes² PrimerSequence (bp) (bp) 5′-UTR Most of 5-UTR-s CCCCTGAATGCGG 13 144Picornaviruses (SEQ ID NO. 1) 5-UTR-a GTCACCATAAGCAGCCA 17 (SEQ ID NO.2) VP1 Most of EV71 VP1-s GAGAGTATGATTGA 14 478 (SEQ ID NO. 3) VP1-aGGTCTTTCTCCTGTTTGTGTTC 22 (SEQ ID NO. 4)¹Abbreviation: 5′-UTR is 5′-terminal untranslated region; VP is viralgenome protein.²Abbreviation: EV71, Enterovirus 71; CVA16, Coxsackievirus A16.

TABLE 5 Degenerated probes used for detect virus targets in the sample.Specific Degenerated Probe¹ Serotypes mer Site Sequences² 5′-UTR Most of62 7 TGTCGTAAYGSGCAASTCYGYRGCGGAACCG PicornavirusesACTACTTTGGGTGTCCGTGTTTCMTTTTATT (SEQ ID NO. 5) VP1-s Most of EV71 73 8TCACCYGCGAGCGCYTAYCARTGGTTTTAYG ACGGGTAYCCCACRTTYGGTGAACACAAACAGGAGAAAGACC (SEQ ID NO. 6)¹Abbreviation: 5′-UTR is 5′-terminal untranslated region; VP is viralgenome protein; PC is positive control; NC is negative control.²Symbols for nucleotide: M (amino) is A and C; R (purine) is G or A;S (strong) is G or C; W (weak) is A or T; Y (pyrimidine) is T or C.

Other Embodiments

All of the features disclosed in this specification may be combined inany combination. Each feature disclosed in this specification may bereplaced by an alternative feature serving the same, equivalent, orsimilar purpose. Thus, unless expressly stated otherwise, each featuredisclosed is only an example of a generic series of equivalent orsimilar features.

From the above description, one skilled in the art can easily ascertainthe essential characteristics of the present invention, and withoutdeparting from the spirit and scope thereof, can make various changesand modifications of the invention to adapt it to various usages andconditions. Thus, other embodiments are also within the claims.

1-22. (canceled)
 23. A method for evaluating a sample, the methodcomprising: contacting a sample that comprises nucleic acids from athird species to a collection of nucleic acid probes, wherein (i) eachprobe comprises a segment, the entirety of which hybridizes under lowstringency conditions to at least a first gene of a first species and asecond gene of a second species, wherein the hybridizing probescorrespond to different genes of the first or second species, and thefirst and the second genes are orthologous to each other, and (ii) thethird species differs from both the first and the second species;evaluating binding of the sample to each of the probes; and for eachprobe that is bound, inferring the presence, in the sample, of a nucleicacid for a third gene of the third species, the third gene being anortholog of the first and second gene.
 24. The method of claim 23,wherein the first and second species are of different genera.
 25. Themethod of claim 23, wherein the first and second species are ofdifferent families.
 26. The method of claim 23, wherein the first andsecond species are of different classes.
 27. The method of claim 23,wherein the first and second species are of different phyla.
 28. Themethod of claim 23, wherein the first and second species are ofdifferent kingdoms. 29-40. (canceled)